Lightweight LCP construction for very large collections of strings

نویسندگان

  • Anthony J. Cox
  • Fabio Garofalo
  • Giovanna Rosone
  • Marinella Sciortino
چکیده

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from “next-generation” DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the BurrowsWheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover, extLCP allows to compute also the suffix array of the strings of the collection, without any other further data structure is needed. Finally, we test our algorithm on real data and compare our results with another tool capable to work in external memory on large collections of strings.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Critique "Lightweight LCP Construction for Next-Generation Sequencing Datasets"

The paper presents the rst lightweight method that simultaneously computes, the longest common pre x array(LCP) and BWT of very large collections of sequences. Knowing the LCP of DNA sequences collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-e cient algorithms for computing the LCP of a string have been describ...

متن کامل

A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the BurrowsWheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and th...

متن کامل

Lightweight LCP Construction for Next-Generation Sequencing Datasets

The advent of “next-generation” DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing...

متن کامل

Computing the BWT and LCP array of a Set of Strings in External Memory

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): recent developments in this field have resulted in external memory algorithms, motivated by the large requirements of in-

متن کامل

Fast and Lightweight LCP-Array Construction Algorithms

The suffix tree is a very important data structure in string processing, but it suffers from a huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the LCP-array, and data structures for simulating navigational operations on the suffix tree. The LCP-array stores the leng...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Discrete Algorithms

دوره 37  شماره 

صفحات  -

تاریخ انتشار 2016